I this notebook we ingest and visualize the mobility trends data provided by Apple, [APPL1].
We take the following steps:
Download the data
Import the data and summarise it
Transform the data into long form
Partition the data into subsets that correspond to combinations of geographical regions and transportation types
Make contingency matrices and corresponding heat-map plots
Make nearest neighbors graphs over the contingency matrices and plot communities
Plot the corresponding time series
About This Data The CSV file and charts on this site show a relative volume of directions requests per country/region or city compared to a baseline volume on January 13th, 2020. We define our day as midnight-to-midnight, Pacific time. Cities represent usage in greater metropolitan areas and are stably defined during this period. In many countries/regions and cities, relative volume has increased since January 13th, consistent with normal, seasonal usage of Apple Maps. Day of week effects are important to normalize as you use this data. Data that is sent from users’ devices to the Maps service is associated with random, rotating identifiers so Apple doesn’t have a profile of your movements and searches. Apple Maps has no demographic information about our users, so we can’t make any statements about the representativeness of our usage against the overall population.
The observations listed in this subsection are also placed under the relevant statistics in the following sections and indicated with “Observation”.
The directions requests volumes reference date for normalization is 2020-01-13 : all the values in that column are \(100\).
From the community clusters of the nearest neighbor graphs (derived from the time series of the normalized driving directions requests volume) we see that countries and cities are clustered in expected ways. For example, in the community graph plot corresponding to “{city, driving}” the cities Oslo, Copenhagen, Helsinki, Stockholm, and Zurich are placed in the same cluster. In the graphs corresponding to “{city, transit}” and “{city, walking}” the Japanese cities Tokyo, Osaka, Nagoya, and Fukuoka are clustered together.
In the time series plots the Sundays are indicated with orange dashed lines. We can see that from Monday to Thursday people are more familiar with their trips than say on Fridays and Saturdays. We can also see that on Sundays people (on average) are more familiar with their trips or simply travel less.
library(Matrix)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4
✓ tibble 3.1.4 ✓ dplyr 1.0.7
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 2.0.1 ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x tidyr::expand() masks Matrix::expand()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
x tidyr::pack() masks Matrix::pack()
x tidyr::unpack() masks Matrix::unpack()
library(ggplot2)
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
library(d3heatmap)
======================
Welcome to d3heatmap version 0.9.0
Type citation('d3heatmap') for how to cite the package.
Type ?d3heatmap for the main documentation.
The github page is: https://github.com/talgalili/d3heatmap/
Please submit your suggestions and bug-reports at: https://github.com/talgalili/d3heatmap/issues
You may ask questions at stackoverflow, use the r and d3heatmap tags:
https://stackoverflow.com/questions/tagged/d3heatmap
======================
Attaching package: ‘d3heatmap’
The following object is masked from ‘package:Matrix’:
print
The following objects are masked from ‘package:base’:
print, save
library(igraph)
Attaching package: ‘igraph’
The following objects are masked from ‘package:dplyr’:
as_data_frame, groups, union
The following objects are masked from ‘package:purrr’:
compose, simplify
The following object is masked from ‘package:tidyr’:
crossing
The following object is masked from ‘package:tibble’:
as_data_frame
The following objects are masked from ‘package:stats’:
decompose, spectrum
The following object is masked from ‘package:base’:
union
library(zoo)
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
library(forecast)
Apple mobile data was provided in this WWW page: https://www.apple.com/covid19/mobility , [APPL1]. (The data has to be download from that web page – there is an “agreement to terms”, etc.)
dfAppleMobility <- read.csv( "~/Downloads/applemobilitytrends-2021-09-19.csv", stringsAsFactors = FALSE)
#dfAppleMobility <- read.csv( "~/Downloads/applemobilitytrends-2021-02-20.csv", stringsAsFactors = FALSE)
#dfAppleMobility <- read.csv("https://covid19-static.cdn-apple.com/covid19-mobility-data/2024HotfixDev18/v3/en-us/applemobilitytrends-2021-01-15.csv")
names(dfAppleMobility) <- gsub( "^X", "", names(dfAppleMobility))
names(dfAppleMobility) <- gsub( ".", "-", names(dfAppleMobility), fixed = TRUE)
dfAppleMobility
Observation: The directions requests volumes reference date for normalization is 2020-01-13 : all the values in that column are \(100\).
Data dimensions:
dim(dfAppleMobility)
[1] 4691 622
Data summary:
summary(as.data.frame(unclass(dfAppleMobility[,1:3]), stringsAsFactors = TRUE))
geo_type region transportation_type
city : 790 Washington County: 27 driving:3048
country/region: 153 Jefferson County : 25 transit: 551
county :2638 Montgomery County: 24 walking:1092
sub-region :1110 Franklin County : 22
Madison County : 21
Jackson County : 19
(Other) :4553
Number of unique “country/region” values:
dfAppleMobility %>%
dplyr::filter( geo_type == "country/region") %>%
dplyr::pull("region") %>%
unique %>%
length
[1] 63
Number of unique “city” values:
dfAppleMobility %>%
dplyr::filter( geo_type == "city") %>%
dplyr::pull("region") %>%
unique %>%
length
[1] 295
All unique geo types:
lsGeoTypes <- unique(dfAppleMobility[["geo_type"]])
lsGeoTypes
[1] "country/region" "city" "sub-region" "county"
All unique transportation types:
lsTransportationTypes <- unique(dfAppleMobility[["transportation_type"]])
lsTransportationTypes
[1] "driving" "walking" "transit"
It is better to have the data in long form (narrow form). For that I am using the package “tidyr”.
# lsIDColumnNames <- c("geo_type", "region", "transportation_type") # For the initial dataset released by Apple.
lsIDColumnNames <- c("geo_type", "region", "transportation_type", "alternative_name", "sub-region", "country" )
dfAppleMobilityLongForm <- tidyr::pivot_longer( data = dfAppleMobility, cols = setdiff( names(dfAppleMobility), lsIDColumnNames), names_to = "Date", values_to = "Value" )
dim(dfAppleMobilityLongForm)
[1] 2889656 8
Remove the rows with “empty” values:
dfAppleMobilityLongForm <- dfAppleMobilityLongForm[ complete.cases(dfAppleMobilityLongForm), ]
dim(dfAppleMobilityLongForm)
[1] 2853808 8
Add the “DateObject” column:
dfAppleMobilityLongForm$DateObject <- as.POSIXct( dfAppleMobilityLongForm$Date, format = "%Y-%m-%d", origin = "1970-01-01" )
Add “day name” (“day of the week”) field:
dfAppleMobilityLongForm$DayName <- weekdays(dfAppleMobilityLongForm$DateObject)
Here is sample of the transformed data:
set.seed(3232)
dfAppleMobilityLongForm %>% dplyr::sample_n( 10 )
Here is summary:
summary(as.data.frame(unclass(dfAppleMobilityLongForm), stringsAsFactors = TRUE))
geo_type region transportation_type alternative_name sub.region country Date
city : 484278 Washington County: 16561 driving:1844517 :2229152 : 812035 United States:1902810 2020-01-13: 4652
country/region: 93789 Jefferson County : 15337 transit: 338421 AB : 1843 Texas : 147809 Japan : 135661 2020-01-14: 4652
county :1618190 Montgomery County: 14730 walking: 670870 ACT : 1843 California: 101866 : 93789 2020-01-15: 4652
sub-region : 657551 Franklin County : 13490 Andalucía : 1843 Georgia : 80345 France : 55238 2020-01-16: 4652
Madison County : 12879 Bayern : 1843 Virginia : 75441 Germany : 52764 2020-01-17: 4652
Jackson County : 11653 BC|Colombie-Britannique: 1843 Florida : 74259 Thailand : 41696 2020-01-18: 4652
(Other) :2769158 (Other) : 615441 (Other) :1562053 (Other) : 571850 (Other) :2825896
Value DateObject DayName
Min. : 0.44 Min. :2020-01-13 00:00:00 Friday :404724
1st Qu.: 91.77 1st Qu.:2020-06-15 00:00:00 Monday :405790
Median : 123.72 Median :2020-11-16 00:00:00 Saturday :409376
Mean : 133.41 Mean :2020-11-15 18:05:16 Sunday :409376
3rd Qu.: 162.47 3rd Qu.:2021-04-19 00:00:00 Thursday :409376
Max. :3303.05 Max. :2021-09-19 00:00:00 Tuesday :405790
Wednesday:409376
Partition the data into geo types × transportation types:
dfAppleMobilityLongForm %>%
dplyr::group_by( geo_type, transportation_type) %>%
dplyr::count()
aQueries <- split(dfAppleMobilityLongForm, dfAppleMobilityLongForm[,c("geo_type", "transportation_type")] )
We can visualize the data using heat-map plots.
Remark: Using the contingency matrices prepared for the heat-map plots we can do further analysis, like, finding correlations or nearest neighbors. (See below.)
Cross-tabulate dates with regions:
aMatDateRegion <- purrr::map( aQueries, function(dfX) { xtabs( formula = Value ~ Date + region, data = dfX, sparse = TRUE ) } )
aMatDateRegion <- aMatDateRegion[ purrr::map_lgl(aMatDateRegion, function(x) nrow(x) > 0 ) ]
dfPlotQuery <- purrr::map_df( aMatDateRegion, Matrix::summary, .id = "Type" )
head(dfPlotQuery)
613 x 295 sparse Matrix of class "dgCMatrix", with 180835 entries
Type i j x
1 city.driving 1 1 100.00
2 city.driving 2 1 100.73
3 city.driving 3 1 102.86
4 city.driving 4 1 102.65
5 city.driving 5 1 109.39
6 city.driving 6 1 109.62
ggplot2::ggplot(dfPlotQuery) +
ggplot2::geom_tile( ggplot2::aes( x = j, y = i, fill = log10(x)), color = "white") +
ggplot2::scale_fill_gradient(low = "white", high = "blue") +
ggplot2::xlab("Region") + ggplot2::ylab("Date") +
ggplot2::facet_wrap( ~Type, scales = "free", ncol = 2)
Here we take a “closer look” to one of the plots using a dedicated d3heatmap plot:
d3heatmap::d3heatmap( x = aMatDateRegion[["country/region.driving"]], Rowv = FALSE )
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Here we create nearest neighbor graphs of the contingency matrices computed above and plot cluster the nodes:
th <- 0.94
aNNGraphs <-
purrr::map( aMatDateRegion, function(m) {
m2 <- cor(as.matrix(m))
for( i in 1:nrow(m2) ) {
m2[i,i] <- 0
}
m2 <- as( m2, "dgCMatrix")
m2@x[ m2@x <= th ] <- 0
#m2@x[ m2@x > th ] <- 1
igraph::graph_from_adjacency_matrix(Matrix::drop0(m2), weighted = TRUE, mode = "undirected")
})
ind <- 3
ceb <- cluster_edge_betweenness(aNNGraphs[[ind]])
dendPlot(ceb, mode="hclust", main = names(aNNGraphs)[[ind]])
plot(ceb, aNNGraphs[[ind]], vertex.size=1, vertex.label=NA, main = names(aNNGraphs)[[ind]])
In this section for each date we sum all cases over the region-transportation pairs, make a time series, and plot them.
Remark: In the plots the Sundays are indicated with orange dashed lines.
Here we make the time series:
aDateStringToDateObject <- unique( dfAppleMobilityLongForm[, c("Date", "DateObject")] )
aDateStringToDateObject <- setNames( aDateStringToDateObject$DateObject, aDateStringToDateObject$Date )
aDateStringToDateObject <- as.POSIXct(aDateStringToDateObject)
aTSDirReqByCountry <- purrr::map( aMatDateRegion, function(m) rowSums(m) )
matTS <- do.call( cbind, aTSDirReqByCountry)
Warning in (function (..., deparse.level = 1) :
number of rows of result is not a multiple of vector length (arg 1)
zooObj <- zoo::zoo( x = matTS, as.POSIXct(rownames(matTS)) )
Here we plot them:
autoplot(zooObj) +
aes(colour = NULL, linetype = NULL) +
facet_grid(Series ~ ., scales = "free_y") +
geom_vline( xintercept = aDateStringToDateObject[weekdays(aDateStringToDateObject) == "Sunday"], color = "orange", linetype = "dashed", size = 0.3 )
Observation: In the time series plots the Sundays are indicated with orange dashed lines. We can see that from Monday to Thursday people are more familiar with their trips than say on Fridays and Saturdays. We can also see that on Sundays people (on average) are more familiar with their trips or simply travel less.
He we do “forecast” for code-workflow demonstration purposes – the forecasts should not be taken seriously.
Fit a time series model to the time series:
aTSModels <- purrr::map( names(zooObj), function(x) { forecast::auto.arima( zoo( x = zooObj[,x], order.by = index(zooObj) ) ) } )
aTSModels <- purrr::map( names(zooObj), function(x) forecast::forecast( as.matrix(zooObj)[,x] ) )
names(aTSModels) <- names(zooObj)
Plot data and forecast:
lsPlots <- purrr::map( names(aTSModels), function(x) autoplot(aTSModels[[x]]) + ylab("Volume") + ggtitle(x) )
names(lsPlots) <- names(aTSModels)
do.call( gridExtra::grid.arrange, lsPlots )
[APPL1] Apple Inc., Mobility Trends Reports, (2020), apple.com.
[AA1] Anton Antonov, “Apple mobility trends data visualization”, (2020), SystemModeling at GitHub.
[AA2] Anton Antonov, “NY Times COVID-19 data visualization”, (2020), SystemModeling at GitHub.